Introduction

Course structure

Presentations built around content developed for the Epidemiologist R handbook

Capstone project

- To apply R skills to data you regularly use

- Present on the challenges and enablers

Why R?

Why Learn R? 10 Handy Reasons to Learn R programming Language

Installing R

For this course, you will need to install 2 items:

  1. R programming language

  2. R Studio

    • Integrated Development Environment (IDE)

    • A very helpful resource for writing and running R code

    You will need to install them in this order - First R, then R studio

Guide to installing R

Setting up files and folders

“Massive Wall of Organized Documents” by Zeusandhera is licensed with CC BY-SA 2.0. To view a copy of this license, visit https://creativecommons.org/licenses/by-sa/2.0/

Setting up files and folders - Best practice

Setting up files and folders will make your analysis (and life!) easier

Setting up files and folders - Folder structure

R Studio works best when you use its project function

Projects are covered in more detail in Chapter 6 “R Projects”

Setting up files and folders - Naming files and folders

If you want to share your code with colleagues or when you return to code after several weeks/months you will be grateful that you gave your files and folders meaningful names!

Many organisations have style guides to ensure that teams can collaborate on coding projects

Tidyverse Style Guide - 1.2 Organisation

R packages

What is an R package?

An R package is a collection of functions which you can use to import, clean, analyse and report your data

Link to Epidemiologist R handbook - 3.7 “Packages

Packages can simplify your workflow by combining multiple steps into a smaller number of commands

Example: readxl is a package of functions used to import data from Excel to R.

R packages - Installing a package

install.packages(“readxl”)

We have asked R to install the package “readxl”.

The installation has been successful. The installation only has to take place once. You do not need to re install the packages every time you start a new project as they are saved in your library.

R packages - Loading a package

Now that readxl has been installed, you will be able to load it and use its functions

library(readxl)

When the package has been successfully loaded, you will see a tick mark in the box.

R packages - Using a package

Each package has multiple functions that you can use on your data.

To read more about a particular package type

?readxl

Help documentation for readxl

For this example, we want to import data that is currently stored in an Excel formatted file “.xlsx”

So we can use the function read_xlsx from the readxl package

read_xlsx(here('data','AfricaCovid','AfricaCovid.xlsx'))
## New names:
## * `` -> ...1
## # A tibble: 58 x 2
##    ...1                     `www.hera-ngo.org`
##    <chr>                    <chr>             
##  1 <NA>                     org.hera@gmail.com
##  2 Country                  Last update       
##  3 Algeria                  44318             
##  4 Angola                   44318             
##  5 Benin                    44317             
##  6 Botswana                 44317             
##  7 Burkina Faso             44317             
##  8 Burundi                  44318             
##  9 Cameroon                 44317             
## 10 Central African Republic 44317             
## # … with 48 more rows

But what does this show? And how can we use it?

So when we tell R to use the function read_xlsx, it reads the first sheet which is called “ReadMore”.

It looks like this is a summary sheet with information about when data for each country was last updated.

So how do we tell R to read in a different sheet from the Excel file?

R packages - Using a package 4

Question - How many confirmed cases of COVID were recorded across Africa in July 2020?

First step - Import data from the sheet containing information on COVID cases

We can use the excel_sheets function from readxl to get the names of all sheets in the Excel workbook

excel_sheets(here('data','AfricaCovid','AfricaCovid.xlsx'))
##  [1] "ReadMore"             "Infected_per_day"     "Recovered_per_day"   
##  [4] "Deceased_per_day"     "Cumulative_infected"  "Cumulative_recovered"
##  [7] "Cumulative_deceased"  "SDN FLore"            "GHA Flore"           
## [10] "SLE Flore"            "ZAF Flore"

From this list we can see that we want to import data from the sheet “Infected_per_day”.

read_xlsx(here('data','AfricaCovid','AfricaCovid.xlsx'), sheet="Infected_per_day")
## # A tibble: 53 x 492
##    ISO   COUNTRY_NAME     AFRICAN_REGION `43831` `43832` `43833` `43834` `43835`
##    <chr> <chr>            <chr>            <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##  1 DZA   Algeria          Northern Afri…       0       0       0       0       0
##  2 AGO   Angola           Southern Afri…       0       0       0       0       0
##  3 BEN   Benin            Western Africa       0       0       0       0       0
##  4 BWA   Botswana         Southern Afri…       0       0       0       0       0
##  5 BFA   Burkina Faso     Western Africa       0       0       0       0       0
##  6 BDI   Burundi          Central Africa       0       0       0       0       0
##  7 CMR   Cameroon         Central Africa       0       0       0       0       0
##  8 CAR   Central African… Central Africa       0       0       0       0       0
##  9 TCD   Chad             Central Africa       0       0       0       0       0
## 10 COM   Comoros          Eastern Africa       0       0       0       0       0
## # … with 43 more rows, and 484 more variables: 43836 <dbl>, 43837 <dbl>,
## #   43838 <dbl>, 43839 <dbl>, 43840 <dbl>, 43841 <dbl>, 43842 <dbl>,
## #   43843 <dbl>, 43844 <dbl>, 43845 <dbl>, 43846 <dbl>, 43847 <dbl>,
## #   43848 <dbl>, 43849 <dbl>, 43850 <dbl>, 43851 <dbl>, 43852 <dbl>,
## #   43853 <dbl>, 43854 <dbl>, 43855 <dbl>, 43856 <dbl>, 43857 <dbl>,
## #   43858 <dbl>, 43859 <dbl>, 43860 <dbl>, 43861 <dbl>, 43862 <dbl>,
## #   43863 <dbl>, 43864 <dbl>, 43865 <dbl>, 43866 <dbl>, 43867 <dbl>,
## #   43868 <dbl>, 43869 <dbl>, 43870 <dbl>, 43871 <dbl>, 43872 <dbl>,
## #   43873 <dbl>, 43874 <dbl>, 43875 <dbl>, 43876 <dbl>, 43877 <dbl>,
## #   43878 <dbl>, 43879 <dbl>, 43880 <dbl>, 43881 <dbl>, 43882 <dbl>,
## #   43883 <dbl>, 43884 <dbl>, 43885 <dbl>, 43886 <dbl>, 43887 <dbl>,
## #   43888 <dbl>, 43889 <dbl>, 43890 <dbl>, 43891 <dbl>, 43892 <dbl>,
## #   43893 <dbl>, 43894 <dbl>, 43895 <dbl>, 43896 <dbl>, 43897 <dbl>,
## #   43898 <dbl>, 43899 <dbl>, 43900 <dbl>, 43901 <dbl>, 43902 <dbl>,
## #   43903 <dbl>, 43904 <dbl>, 43905 <dbl>, 43906 <dbl>, 43907 <dbl>,
## #   43908 <dbl>, 43909 <dbl>, 43910 <dbl>, 43911 <dbl>, 43912 <dbl>,
## #   43913 <dbl>, 43914 <dbl>, 43915 <dbl>, 43916 <dbl>, 43917 <dbl>,
## #   43918 <dbl>, 43919 <dbl>, 43920 <dbl>, 43921 <dbl>, 43922 <dbl>,
## #   43923 <dbl>, 43924 <dbl>, 43925 <dbl>, 43926 <dbl>, 43927 <dbl>,
## #   43928 <dbl>, 43929 <dbl>, 43930 <dbl>, 43931 <dbl>, 43932 <dbl>,
## #   43933 <dbl>, 43934 <dbl>, 43935 <dbl>, …

We can see a snapshot of the data from the sheet “Infected_per_day”

Objects - In R, everything is an object

So far we have installed, loaded and used a package (readxl)

But how do we use the information generated from these actions?

We assign the information to “objects”

Section in Epidemiologist for R handbook about Objects

“Everything you store in R - datasets, variables, a list of village names, a total population number, even outputs such as graphs - are objects which are assigned a name and can be referenced in later commands.”

Objects - Example

To explain objects, we will calculate a value, assign it to an object and then use the object for a second calculation.

2+2
## [1] 4

We can assign the calculation “2+2” to an object called “a”

a <- 2+2

We can then use the object a to show the results of the calculation

a
## [1] 4

We can also use this value for further calculations such as adding 4 to the object a

a + 4
## [1] 8
b <- a+4

The result of this calculation is now stored in the object “b”

b
## [1] 8

Objects: Assigning our data to an object

In the previous section, we used the function read_xlsx from the package readxl to import data from an Excel spreadsheet.

But we didn’t assign this to an object, so it is not possible to use the data from the import step.

We can assign the data to an object and then conduct further analysis.

africa_covid_cases <- read_xlsx(here('data','AfricaCovid','AfricaCovid.xlsx'), sheet="Infected_per_day")

You will now see the object in the “Environment” section of R Studio.

Now the data have been assigned to the object “africa_covid_cases”, we can start to work with the data.

Working with data

In the africa_covid_cases object, there are 53 obs (observations) of 492 variables.

So what does this mean?

We can look at our data to get more information

africa_covid_cases
## # A tibble: 53 x 492
##    ISO   COUNTRY_NAME     AFRICAN_REGION `43831` `43832` `43833` `43834` `43835`
##    <chr> <chr>            <chr>            <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##  1 DZA   Algeria          Northern Afri…       0       0       0       0       0
##  2 AGO   Angola           Southern Afri…       0       0       0       0       0
##  3 BEN   Benin            Western Africa       0       0       0       0       0
##  4 BWA   Botswana         Southern Afri…       0       0       0       0       0
##  5 BFA   Burkina Faso     Western Africa       0       0       0       0       0
##  6 BDI   Burundi          Central Africa       0       0       0       0       0
##  7 CMR   Cameroon         Central Africa       0       0       0       0       0
##  8 CAR   Central African… Central Africa       0       0       0       0       0
##  9 TCD   Chad             Central Africa       0       0       0       0       0
## 10 COM   Comoros          Eastern Africa       0       0       0       0       0
## # … with 43 more rows, and 484 more variables: 43836 <dbl>, 43837 <dbl>,
## #   43838 <dbl>, 43839 <dbl>, 43840 <dbl>, 43841 <dbl>, 43842 <dbl>,
## #   43843 <dbl>, 43844 <dbl>, 43845 <dbl>, 43846 <dbl>, 43847 <dbl>,
## #   43848 <dbl>, 43849 <dbl>, 43850 <dbl>, 43851 <dbl>, 43852 <dbl>,
## #   43853 <dbl>, 43854 <dbl>, 43855 <dbl>, 43856 <dbl>, 43857 <dbl>,
## #   43858 <dbl>, 43859 <dbl>, 43860 <dbl>, 43861 <dbl>, 43862 <dbl>,
## #   43863 <dbl>, 43864 <dbl>, 43865 <dbl>, 43866 <dbl>, 43867 <dbl>,
## #   43868 <dbl>, 43869 <dbl>, 43870 <dbl>, 43871 <dbl>, 43872 <dbl>,
## #   43873 <dbl>, 43874 <dbl>, 43875 <dbl>, 43876 <dbl>, 43877 <dbl>,
## #   43878 <dbl>, 43879 <dbl>, 43880 <dbl>, 43881 <dbl>, 43882 <dbl>,
## #   43883 <dbl>, 43884 <dbl>, 43885 <dbl>, 43886 <dbl>, 43887 <dbl>,
## #   43888 <dbl>, 43889 <dbl>, 43890 <dbl>, 43891 <dbl>, 43892 <dbl>,
## #   43893 <dbl>, 43894 <dbl>, 43895 <dbl>, 43896 <dbl>, 43897 <dbl>,
## #   43898 <dbl>, 43899 <dbl>, 43900 <dbl>, 43901 <dbl>, 43902 <dbl>,
## #   43903 <dbl>, 43904 <dbl>, 43905 <dbl>, 43906 <dbl>, 43907 <dbl>,
## #   43908 <dbl>, 43909 <dbl>, 43910 <dbl>, 43911 <dbl>, 43912 <dbl>,
## #   43913 <dbl>, 43914 <dbl>, 43915 <dbl>, 43916 <dbl>, 43917 <dbl>,
## #   43918 <dbl>, 43919 <dbl>, 43920 <dbl>, 43921 <dbl>, 43922 <dbl>,
## #   43923 <dbl>, 43924 <dbl>, 43925 <dbl>, 43926 <dbl>, 43927 <dbl>,
## #   43928 <dbl>, 43929 <dbl>, 43930 <dbl>, 43931 <dbl>, 43932 <dbl>,
## #   43933 <dbl>, 43934 <dbl>, 43935 <dbl>, …

ISO - 3 letter code assigned to each country

COUNTRY_NAME - Name of the country

AFRICAN_REGION - African region

43831, 43832, 43833 - This looks like a date format used by Excel. It is the number of days since January 1, 1970.

Working with data - other ways to look at data

Show the first 5 rows

head(africa_covid_cases, n=5)
## # A tibble: 5 x 492
##   ISO   COUNTRY_NAME AFRICAN_REGION  `43831` `43832` `43833` `43834` `43835`
##   <chr> <chr>        <chr>             <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1 DZA   Algeria      Northern Africa       0       0       0       0       0
## 2 AGO   Angola       Southern Africa       0       0       0       0       0
## 3 BEN   Benin        Western Africa        0       0       0       0       0
## 4 BWA   Botswana     Southern Africa       0       0       0       0       0
## 5 BFA   Burkina Faso Western Africa        0       0       0       0       0
## # … with 484 more variables: 43836 <dbl>, 43837 <dbl>, 43838 <dbl>,
## #   43839 <dbl>, 43840 <dbl>, 43841 <dbl>, 43842 <dbl>, 43843 <dbl>,
## #   43844 <dbl>, 43845 <dbl>, 43846 <dbl>, 43847 <dbl>, 43848 <dbl>,
## #   43849 <dbl>, 43850 <dbl>, 43851 <dbl>, 43852 <dbl>, 43853 <dbl>,
## #   43854 <dbl>, 43855 <dbl>, 43856 <dbl>, 43857 <dbl>, 43858 <dbl>,
## #   43859 <dbl>, 43860 <dbl>, 43861 <dbl>, 43862 <dbl>, 43863 <dbl>,
## #   43864 <dbl>, 43865 <dbl>, 43866 <dbl>, 43867 <dbl>, 43868 <dbl>,
## #   43869 <dbl>, 43870 <dbl>, 43871 <dbl>, 43872 <dbl>, 43873 <dbl>,
## #   43874 <dbl>, 43875 <dbl>, 43876 <dbl>, 43877 <dbl>, 43878 <dbl>,
## #   43879 <dbl>, 43880 <dbl>, 43881 <dbl>, 43882 <dbl>, 43883 <dbl>,
## #   43884 <dbl>, 43885 <dbl>, 43886 <dbl>, 43887 <dbl>, 43888 <dbl>,
## #   43889 <dbl>, 43890 <dbl>, 43891 <dbl>, 43892 <dbl>, 43893 <dbl>,
## #   43894 <dbl>, 43895 <dbl>, 43896 <dbl>, 43897 <dbl>, 43898 <dbl>,
## #   43899 <dbl>, 43900 <dbl>, 43901 <dbl>, 43902 <dbl>, 43903 <dbl>,
## #   43904 <dbl>, 43905 <dbl>, 43906 <dbl>, 43907 <dbl>, 43908 <dbl>,
## #   43909 <dbl>, 43910 <dbl>, 43911 <dbl>, 43912 <dbl>, 43913 <dbl>,
## #   43914 <dbl>, 43915 <dbl>, 43916 <dbl>, 43917 <dbl>, 43918 <dbl>,
## #   43919 <dbl>, 43920 <dbl>, 43921 <dbl>, 43922 <dbl>, 43923 <dbl>,
## #   43924 <dbl>, 43925 <dbl>, 43926 <dbl>, 43927 <dbl>, 43928 <dbl>,
## #   43929 <dbl>, 43930 <dbl>, 43931 <dbl>, 43932 <dbl>, 43933 <dbl>,
## #   43934 <dbl>, 43935 <dbl>, …

Show the last 7 rows

tail(africa_covid_cases, n=7)
## # A tibble: 7 x 492
##   ISO   COUNTRY_NAME AFRICAN_REGION  `43831` `43832` `43833` `43834` `43835`
##   <chr> <chr>        <chr>             <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1 SDN   Sudan        Eastern Africa        0       0       0       0       0
## 2 TZA   Tanzania     Eastern Africa        0       0       0       0       0
## 3 TGO   Togo         Western Africa        0       0       0       0       0
## 4 TUN   Tunisia      Northern Africa       0       0       0       0       0
## 5 UGA   Uganda       Eastern Africa        0       0       0       0       0
## 6 ZMB   Zambia       Southern Africa       0       0       0       0       0
## 7 ZWE   Zimbabwe     Southern Africa       0       0       0       0       0
## # … with 484 more variables: 43836 <dbl>, 43837 <dbl>, 43838 <dbl>,
## #   43839 <dbl>, 43840 <dbl>, 43841 <dbl>, 43842 <dbl>, 43843 <dbl>,
## #   43844 <dbl>, 43845 <dbl>, 43846 <dbl>, 43847 <dbl>, 43848 <dbl>,
## #   43849 <dbl>, 43850 <dbl>, 43851 <dbl>, 43852 <dbl>, 43853 <dbl>,
## #   43854 <dbl>, 43855 <dbl>, 43856 <dbl>, 43857 <dbl>, 43858 <dbl>,
## #   43859 <dbl>, 43860 <dbl>, 43861 <dbl>, 43862 <dbl>, 43863 <dbl>,
## #   43864 <dbl>, 43865 <dbl>, 43866 <dbl>, 43867 <dbl>, 43868 <dbl>,
## #   43869 <dbl>, 43870 <dbl>, 43871 <dbl>, 43872 <dbl>, 43873 <dbl>,
## #   43874 <dbl>, 43875 <dbl>, 43876 <dbl>, 43877 <dbl>, 43878 <dbl>,
## #   43879 <dbl>, 43880 <dbl>, 43881 <dbl>, 43882 <dbl>, 43883 <dbl>,
## #   43884 <dbl>, 43885 <dbl>, 43886 <dbl>, 43887 <dbl>, 43888 <dbl>,
## #   43889 <dbl>, 43890 <dbl>, 43891 <dbl>, 43892 <dbl>, 43893 <dbl>,
## #   43894 <dbl>, 43895 <dbl>, 43896 <dbl>, 43897 <dbl>, 43898 <dbl>,
## #   43899 <dbl>, 43900 <dbl>, 43901 <dbl>, 43902 <dbl>, 43903 <dbl>,
## #   43904 <dbl>, 43905 <dbl>, 43906 <dbl>, 43907 <dbl>, 43908 <dbl>,
## #   43909 <dbl>, 43910 <dbl>, 43911 <dbl>, 43912 <dbl>, 43913 <dbl>,
## #   43914 <dbl>, 43915 <dbl>, 43916 <dbl>, 43917 <dbl>, 43918 <dbl>,
## #   43919 <dbl>, 43920 <dbl>, 43921 <dbl>, 43922 <dbl>, 43923 <dbl>,
## #   43924 <dbl>, 43925 <dbl>, 43926 <dbl>, 43927 <dbl>, 43928 <dbl>,
## #   43929 <dbl>, 43930 <dbl>, 43931 <dbl>, 43932 <dbl>, 43933 <dbl>,
## #   43934 <dbl>, 43935 <dbl>, …

How many unique countries are in the data?

unique(africa_covid_cases$COUNTRY_NAME)
##  [1] "Algeria"                          "Angola"                          
##  [3] "Benin"                            "Botswana"                        
##  [5] "Burkina Faso"                     "Burundi"                         
##  [7] "Cameroon"                         "Central African Republic"        
##  [9] "Chad"                             "Comoros"                         
## [11] "Congo"                            "Cote d'Ivoire"                   
## [13] "Democratic Republic of the Congo" "Djibouti"                        
## [15] "Egypt"                            "Equatorial Guinea"               
## [17] "Eritrea"                          "Eswatini"                        
## [19] "Ethiopia"                         "Gabon"                           
## [21] "Gambia"                           "Ghana"                           
## [23] "Guinea"                           "Guinea-Bissau"                   
## [25] "Kenya"                            "Lesotho"                         
## [27] "Liberia"                          "Libya"                           
## [29] "Madagascar"                       "Malawi"                          
## [31] "Mali"                             "Mauritania"                      
## [33] "Mauritius"                        "Mayotte"                         
## [35] "Morocco"                          "Mozambique"                      
## [37] "Namibia"                          "Niger"                           
## [39] "Nigeria"                          "Rwanda"                          
## [41] "Sao Tome and Principe"            "Senegal"                         
## [43] "Sierra Leone"                     "Somalia"                         
## [45] "South Africa"                     "South Sudan"                     
## [47] "Sudan"                            "Tanzania"                        
## [49] "Togo"                             "Tunisia"                         
## [51] "Uganda"                           "Zambia"                          
## [53] "Zimbabwe"

There are 53 unique country values. This is helpful as there are also 53 rows so we can say that each row represents a country.

We can assign the list of unique countries to an object for future reference

country_list <- unique(africa_covid_cases$COUNTRY_NAME)

Working with data - looking at one variable

In the previous step, the following command was used

unique(africa_covid_cases$COUNTRY_NAME)

What does “$” do in R?

It allows us to look at a specific variable within the dataset

unique(africa_covid_cases$AFRICAN_REGION)
## [1] "Northern Africa" "Southern Africa" "Western Africa"  "Central Africa" 
## [5] "Eastern Africa"

And again we can assign this to an object

region_list <- unique(africa_covid_cases$AFRICAN_REGION)

The tidyverse

When using R, there are many approaches you can use to reach the same result.

There are thousands of packages with many functions and sometimes these packages can overlap.

This can be confusing when you are starting to learn R.

There is a collection of packages with many of the most commonly used packages and this is called the tidyverse.

tidyverse::tidyverse_packages()
##  [1] "broom"         "cli"           "crayon"        "dbplyr"       
##  [5] "dplyr"         "dtplyr"        "forcats"       "googledrive"  
##  [9] "googlesheets4" "ggplot2"       "haven"         "hms"          
## [13] "httr"          "jsonlite"      "lubridate"     "magrittr"     
## [17] "modelr"        "pillar"        "purrr"         "readr"        
## [21] "readxl"        "reprex"        "rlang"         "rstudioapi"   
## [25] "rvest"         "stringr"       "tibble"        "tidyr"        
## [29] "xml2"          "tidyverse"

We will use functions from some of these packages over the next few sessions.

The tidyverse: Tidy data

The key concept when working with packages from the tidyverse is the concept of “tidy data”.

R for Epidemiologist handbook 4.1 From Excel - Tidy data

Principles of “tidy data”:

  1. Each variable must have its own column
  2. Each observation must have its own row
  3. Each value must have its own cell

The tidyverse: Why is this important?

Functions from the tidyverse packages are set up to work with tidy data.

If your data are not tidy, then you will have to restructure the data to a tidy format.

Restructuring can take a lot of time if the data are stored in Excel spreadsheets with a lot of formatting/merged columns.

Tidy data for efficiency, reproducibility, and collaboration. By Julie Lowndes and Allison Horst.

The tidyverse: Checking if data are tidy

In a previous step, we imported COVID case data from an Excel spreadsheet.
But how do we know if the data are “tidy”

Remember there are 3 principles

  1. Each variable must have its own column
  2. Each observation must have its own row
  3. Each value must have its own cell
head(africa_covid_cases, n=3)
## # A tibble: 3 x 492
##   ISO   COUNTRY_NAME AFRICAN_REGION  `43831` `43832` `43833` `43834` `43835`
##   <chr> <chr>        <chr>             <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1 DZA   Algeria      Northern Africa       0       0       0       0       0
## 2 AGO   Angola       Southern Africa       0       0       0       0       0
## 3 BEN   Benin        Western Africa        0       0       0       0       0
## # … with 484 more variables: 43836 <dbl>, 43837 <dbl>, 43838 <dbl>,
## #   43839 <dbl>, 43840 <dbl>, 43841 <dbl>, 43842 <dbl>, 43843 <dbl>,
## #   43844 <dbl>, 43845 <dbl>, 43846 <dbl>, 43847 <dbl>, 43848 <dbl>,
## #   43849 <dbl>, 43850 <dbl>, 43851 <dbl>, 43852 <dbl>, 43853 <dbl>,
## #   43854 <dbl>, 43855 <dbl>, 43856 <dbl>, 43857 <dbl>, 43858 <dbl>,
## #   43859 <dbl>, 43860 <dbl>, 43861 <dbl>, 43862 <dbl>, 43863 <dbl>,
## #   43864 <dbl>, 43865 <dbl>, 43866 <dbl>, 43867 <dbl>, 43868 <dbl>,
## #   43869 <dbl>, 43870 <dbl>, 43871 <dbl>, 43872 <dbl>, 43873 <dbl>,
## #   43874 <dbl>, 43875 <dbl>, 43876 <dbl>, 43877 <dbl>, 43878 <dbl>,
## #   43879 <dbl>, 43880 <dbl>, 43881 <dbl>, 43882 <dbl>, 43883 <dbl>,
## #   43884 <dbl>, 43885 <dbl>, 43886 <dbl>, 43887 <dbl>, 43888 <dbl>,
## #   43889 <dbl>, 43890 <dbl>, 43891 <dbl>, 43892 <dbl>, 43893 <dbl>,
## #   43894 <dbl>, 43895 <dbl>, 43896 <dbl>, 43897 <dbl>, 43898 <dbl>,
## #   43899 <dbl>, 43900 <dbl>, 43901 <dbl>, 43902 <dbl>, 43903 <dbl>,
## #   43904 <dbl>, 43905 <dbl>, 43906 <dbl>, 43907 <dbl>, 43908 <dbl>,
## #   43909 <dbl>, 43910 <dbl>, 43911 <dbl>, 43912 <dbl>, 43913 <dbl>,
## #   43914 <dbl>, 43915 <dbl>, 43916 <dbl>, 43917 <dbl>, 43918 <dbl>,
## #   43919 <dbl>, 43920 <dbl>, 43921 <dbl>, 43922 <dbl>, 43923 <dbl>,
## #   43924 <dbl>, 43925 <dbl>, 43926 <dbl>, 43927 <dbl>, 43928 <dbl>,
## #   43929 <dbl>, 43930 <dbl>, 43931 <dbl>, 43932 <dbl>, 43933 <dbl>,
## #   43934 <dbl>, 43935 <dbl>, …

So are the data “tidy”?

The data from the spreadsheet are not “tidy”.

The columns “43831, 43832, 43833…” represent different dates. Therefore, this does meet the second argument of “tidy data” - “Each observation must have its own row”.

But we can fix this using functions from the packages included in the tidyverse

Remember, first we must install the packages from the tidyverse

install.packages("tidyverse")

The tidyverse: Tidying data

Now that the tidyverse has been installed, we can use the functions from the packages to “tidy” the data.

One package which is very helpful for this is called “tidyr

Instead of loading individual packages, we can load the core tidyverse packages with one command

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.4     ✓ purrr   0.3.4
## ✓ tibble  3.1.2     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.3     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

The core packages contain powerful functions we can use to process, analyse and visualise data.

Remember to look at the documentation for a package type “?[name of package]”

Example -

?tidyr

To look at the functions within a package, type [name of package]::

Example

tidyr::

To reformat the data to a tidy format, we need to transform the data from wide to long.

The Epidemiologist R handbook has an excellent section describing how to do this

12 - Pivoting data

The tidyverse: Wide to long

From the Epidemiologist R handbook

africa_covid_cases_long <- africa_covid_cases %>% 
  pivot_longer(cols=4:492, names_to="excel_date", values_to="cases")

Transforming data from wide to long usually requires a few attempts to ensure you have the correct outcome!

head(africa_covid_cases_long, n=3)
## # A tibble: 3 x 5
##   ISO   COUNTRY_NAME AFRICAN_REGION  excel_date cases
##   <chr> <chr>        <chr>           <chr>      <dbl>
## 1 DZA   Algeria      Northern Africa 43831          0
## 2 DZA   Algeria      Northern Africa 43832          0
## 3 DZA   Algeria      Northern Africa 43833          0

This looks correct!

You can add comments to code to show other people (and remind yourself!) why you wrote the code in a particular way

africa_covid_cases_long <-
  africa_covid_cases %>% #tell R to use this dataset
  pivot_longer(cols = 4:492,#select the columns you want
               names_to = "excel_date", #name the new date column
               values_to = "cases") #name the new cases column

Working with dates

To add to the confusion, Excel has 2 additional date systems:

  1. 1900 date system

  2. 1904 date system

In the data set we are using, the dates are in this format:

head(africa_covid_cases_long$excel_date)
## [1] "43831" "43832" "43833" "43834" "43835" "43836"

We can use a function from another package to convert this to a standard date format.

install.packages("janitor")
library(janitor)
## 
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

The package janitor has many helpful functions for cleaning data

africa_covid_cases_long <- africa_covid_cases_long %>% 
  mutate(date_format=excel_numeric_to_date(as.numeric(excel_date)))

head(africa_covid_cases_long$date_format)
## [1] "2020-01-01" "2020-01-02" "2020-01-03" "2020-01-04" "2020-01-05"
## [6] "2020-01-06"

The new variable created “date_format” is in the format YEAR-MONTH-DATE.

We can also check if the values in the new variable look correct

min(africa_covid_cases_long$date_format) #minimum date
## [1] "2020-01-01"
max(africa_covid_cases_long$date_format) #maximum date
## [1] "2021-05-03"

We know this is a data set of COVID cases so the date range (from the start of 2020 through to May of 2021) looks to be correct.

Your analysis dataset

We have created 2 data objects - africa_covid_cases and africa_covid_cases_long

Before analysing the data, it is a good idea to generate a new dataset which only contains the variables you need to analyse.

So what variables do we have in africa_covid_cases_long

names(africa_covid_cases_long)
## [1] "ISO"            "COUNTRY_NAME"   "AFRICAN_REGION" "excel_date"    
## [5] "cases"          "date_format"

We can select the variables we want to keep using the select function from the dplyr package

dplyr is a core part of the tidyverse so it is loaded when you write library(tidyverse)

analysis_dataset <- africa_covid_cases_long %>% 
  select(date_format,AFRICAN_REGION, COUNTRY_NAME, cases)

We can look at the first few rows of the dataset we have created to check we have selected the desired variables.

head(analysis_dataset)
## # A tibble: 6 x 4
##   date_format AFRICAN_REGION  COUNTRY_NAME cases
##   <date>      <chr>           <chr>        <dbl>
## 1 2020-01-01  Northern Africa Algeria          0
## 2 2020-01-02  Northern Africa Algeria          0
## 3 2020-01-03  Northern Africa Algeria          0
## 4 2020-01-04  Northern Africa Algeria          0
## 5 2020-01-05  Northern Africa Algeria          0
## 6 2020-01-06  Northern Africa Algeria          0

The select function from the dplyr package is very useful.

It can also be used to rename selected variables

analysis_dataset <- africa_covid_cases_long %>% 
  select(date=date_format,region=AFRICAN_REGION, country=COUNTRY_NAME, cases)

We have renamed AFRICAN_REGION and COUNTRY_NAME as region and country

head(analysis_dataset)
## # A tibble: 6 x 4
##   date       region          country cases
##   <date>     <chr>           <chr>   <dbl>
## 1 2020-01-01 Northern Africa Algeria     0
## 2 2020-01-02 Northern Africa Algeria     0
## 3 2020-01-03 Northern Africa Algeria     0
## 4 2020-01-04 Northern Africa Algeria     0
## 5 2020-01-05 Northern Africa Algeria     0
## 6 2020-01-06 Northern Africa Algeria     0

Answering questions with data 1

So far we have:

Now we can start to use the dataset to answer questions

How many confirmed cases of COVID-19 have been recorded in Africa?

analysis_dataset %>% 
    summarise(total_covid_cases=sum(cases))
## # A tibble: 1 x 1
##   total_covid_cases
##               <dbl>
## 1                NA

The answer is “NA”, which stands for “Not Available”

This is a good example of how R deals with missing data

There are several options for dealing with missing values in R

  1. Complete case analysis

    • Remove rows with any missing data
full_dataset <- na.omit(analysis_dataset)
  1. Exclude “NA” values from calculations

    • Add an additional argument to the function to remove “NA”
analysis_dataset %>% 
    summarise(total_covid_cases=sum(cases, na.rm=TRUE))
## # A tibble: 1 x 1
##   total_covid_cases
##               <dbl>
## 1           4561465

This command has now excluded NA values and has provided us with an answer for the number of confirmed COVID-19 cases in Africa - 4,561,465

How many confirmed cases of COVID-19 have been recorded in Africa, by region?

analysis_dataset %>% 
  group_by(region) %>% 
    summarise(total_covid_cases=sum(cases, na.rm=TRUE))
## # A tibble: 5 x 2
##   region          total_covid_cases
##   <chr>                       <dbl>
## 1 Central Africa             161353
## 2 Eastern Africa             622537
## 3 Northern Africa           1371469
## 4 Southern Africa           1970137
## 5 Western Africa             435969

group_by is a very powerful function for summarising data

analysis_dataset %>% 
  group_by(region) %>% 
    summarise(total_covid_cases=sum(cases, na.rm=TRUE)) %>% 
  arrange(-total_covid_cases)
## # A tibble: 5 x 2
##   region          total_covid_cases
##   <chr>                       <dbl>
## 1 Southern Africa           1970137
## 2 Northern Africa           1371469
## 3 Eastern Africa             622537
## 4 Western Africa             435969
## 5 Central Africa             161353

The arrange function can be used to organise the results. In this case we have instructed R to sort the results by the total_covid_cases variable, from highest to lowest value.

We can add multiple variables to group_by

If we add region and country to the group_by command, sort from highest to lowest, we can see which countries reported the most confirmed COVID-19 cases

analysis_dataset %>% 
  group_by(region, country) %>% 
    summarise(total_covid_cases=sum(cases, na.rm=TRUE)) %>% 
  arrange(-total_covid_cases)
## `summarise()` has grouped output by 'region'. You can override using the `.groups` argument.
## # A tibble: 53 x 3
## # Groups:   region [5]
##    region          country      total_covid_cases
##    <chr>           <chr>                    <dbl>
##  1 Southern Africa South Africa           1584064
##  2 Northern Africa Morocco                 511856
##  3 Northern Africa Tunisia                 311743
##  4 Eastern Africa  Ethiopia                258353
##  5 Northern Africa Egypt                   228584
##  6 Northern Africa Libya                   178335
##  7 Western Africa  Nigeria                 165199
##  8 Eastern Africa  Kenya                   160559
##  9 Northern Africa Algeria                 122522
## 10 Western Africa  Ghana                    92683
## # … with 43 more rows

Another useful function is filter which can be used to apply filters to calculations

We can repeat the previous calculation, but then add a filter to only include results from countries in Northern Africa

analysis_dataset %>% 
  group_by(region, country) %>% 
    summarise(total_covid_cases=sum(cases, na.rm=TRUE)) %>% 
  arrange(-total_covid_cases) %>% 
  filter(region=="Northern Africa")
## `summarise()` has grouped output by 'region'. You can override using the `.groups` argument.
## # A tibble: 6 x 3
## # Groups:   region [1]
##   region          country    total_covid_cases
##   <chr>           <chr>                  <dbl>
## 1 Northern Africa Morocco               511856
## 2 Northern Africa Tunisia               311743
## 3 Northern Africa Egypt                 228584
## 4 Northern Africa Libya                 178335
## 5 Northern Africa Algeria               122522
## 6 Northern Africa Mauritania             18429

The filter can be applied at any point within the calculation. For very complex calculations, it is helpful to apply the filter as early as possible. This reduces the number of records before the complex portion of the calculation occurs.

filter can also be used to make data frames

northern_africa <- analysis_dataset %>% 
  filter(region=="Northern Africa")

Using filters, we can answer additional questions.

When was the first confirmed case of COVID-19 in Northern Africa?

northern_africa %>% 
  filter(cases>0) %>% 
  filter(date == min(date, na.rm=TRUE)) 
## # A tibble: 1 x 4
##   date       region          country cases
##   <date>     <chr>           <chr>   <dbl>
## 1 2020-02-14 Northern Africa Egypt       1

Here we have added 2 filters:

  1. Only keep records where the value for cases is higher than 0

  2. Only keep records where the value for date is equal to the minimum value for date. We have also added the na.rm=TRUE command from a previous step. If you don’t know the data very well, it is good practice to add this command.

When was the first confirmed case of COVID-19 in Northern Africa, by country?

northern_africa %>% 
  group_by(country) %>% 
  filter(cases>0) %>% 
  filter(date == min(date, na.rm=TRUE)) 
## # A tibble: 6 x 4
## # Groups:   country [6]
##   date       region          country    cases
##   <date>     <chr>           <chr>      <dbl>
## 1 2020-02-25 Northern Africa Algeria        1
## 2 2020-02-14 Northern Africa Egypt          1
## 3 2020-03-24 Northern Africa Libya          1
## 4 2020-03-13 Northern Africa Mauritania     1
## 5 2020-03-02 Northern Africa Morocco        1
## 6 2020-03-02 Northern Africa Tunisia        1

These results can be stored in an object for future use

first_cases_northern_africa <- northern_africa %>% 
  group_by(country) %>% 
  filter(cases>0) %>% 
  filter(date == min(date, na.rm=TRUE)) 

On what date, was the 100th case of COVID-19 reported from each country in Northern Africa?

northern_africa %>% 
  group_by(country) %>% 
  mutate(cumulative_cases=cumsum(cases)) %>% 
  filter(cumulative_cases>=100) %>% 
  slice(1) %>% 
  pull(date, country)
##      Algeria        Egypt        Libya   Mauritania      Morocco      Tunisia 
## "2020-03-20" "2020-03-15" "2020-05-28" "2020-05-19" "2020-03-22" "2020-03-24"

Here we have introduced two new functions slice and pull

slice can be used to select certain rows from a dataset. In this case, we have added a column which is the cumulative number of cases, selected the first row after filtering the dataset to only include results where the value is greater than or equal to 100, and then selected the first row using the slice command.

An additional step is the pull command. This is useful when you want to extract specific values from the result.

first_100cases <- northern_africa %>% 
  group_by(country) %>% 
  mutate(cumulative_cases=cumsum(cases)) %>% 
  filter(cumulative_cases>=100) %>% 
  slice(1) %>% 
  pull(date, country)

Visualising data

One of the key strengths of R is visualising data. There are many packages which have functions you can use to make graphs, tables, maps…the list is endless!

The first package of functions we will use for visualising data is another core tidyverse package called ggplot2. This is commonly referred to as ggplot

We have already loaded the package when we ran library(tidyverse)

You can also choose to only load the ggplot2 package by typing library(ggplot2)

library(ggplot2)

The Epidemiologist R handbook has 2 sections focused on ggplot

  1. ggplot basics

  2. ggplot tips

These sections contain very helpful explanations of many of the functions available with ggplot. There are also a number of excellent references for every type of graph you want to make.

We will walk through some common examples to teach some of the most common approaches

Make a graph of confirmed COVID-19 cases in Northern Africa

ggplot(northern_africa, aes(x=date,y=cases)) +
  geom_line()
## Warning: Removed 7 row(s) containing missing values (geom_path).

This command has generated a line graph of confirmed COVID-19 cases for countries in Northern Africa.

From earlier steps, we know that the dataset northern_africa contains data from multiple countries: `r unique(northern_africa$country’

We can add more information to the ggplot command to draw separate lines for each country

ggplot(northern_africa, aes(x=date,y=cases, color=country)) +
  geom_line()
## Warning: Removed 9 row(s) containing missing values (geom_path).

To make the graph more presentable, we can add more options to the ggplot command

ggplot(northern_africa, aes(x=date,y=cases, color=country)) +
  geom_line() +
  labs(x='Date', y='Total cases', color='Country') + #label axes
  theme(legend.position='top') + #place legend at top of graph
  scale_x_date(date_breaks = '2 months', #set x axis to have 2 month breaks
               date_minor_breaks = '1 month', #set x axis to have 1 month breaks
               date_labels = '%d-%m-%y') #change label for x axis
## Warning: Removed 9 row(s) containing missing values (geom_path).

It is still difficult to see the data for each country. There is a helpful command to fix this

ggplot(northern_africa, aes(x=date,y=cases, color=country)) +
  geom_line() +
  labs(x='Date', y='Total cases', color='Country') + #label axes
  theme(legend.position='none') + #remove legend by setting position to 'none'
  scale_x_date(date_breaks = '2 months', #set x axis to have 2 month breaks
               date_minor_breaks = '1 month', #set x axis to have 1 month breaks
               date_labels = '%d-%m-%y') + #change label for x axis
  facet_wrap(~country) # this will create a separate graph for each country
## Warning: Removed 9 row(s) containing missing values (geom_path).